Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [4]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 71.6 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 1.5.3 which is incompatible.
In [3]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

Loading the dataset

In [5]:
df = '/content/sample_data/Loan_Modelling.csv'

df = pd.read_csv(df)

Data Overview

  • Observations
  • Sanity checks
In [7]:
df.shape
Out[7]:
(5000, 14)
In [6]:
df.head()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
In [9]:
df.describe()
Out[9]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.467954 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 -3.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000

Observations:

  1. There are 5000 rows and 14 columns in the data
  2. All datatypes in the data are integer types except the CCAvg, which is Float.
  3. There are no null values

Exploratory Data Analysis.

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [10]:
plt.figure(figsize = (15,5))
sns.boxplot(data = df, x = 'Mortgage');
In [11]:
plt.figure(figsize = (15,5))
sns.histplot(data = df, x = 'Mortgage');
In [12]:
sns.histplot(data=df,x='CCAvg')
plt.show()
In [13]:
sns.boxplot(data=df,x='CCAvg')
plt.show()
In [14]:
sns.histplot(data=df,x='Income')
plt.show()
In [15]:
df['Personal_Loan'].value_counts()
Out[15]:
Personal_Loan
0    4520
1     480
Name: count, dtype: int64
In [16]:
df['CreditCard'].value_counts()
Out[16]:
CreditCard
0    3530
1    1470
Name: count, dtype: int64
In [17]:
df['Online'].value_counts()
Out[17]:
Online
1    2984
0    2016
Name: count, dtype: int64
In [18]:
sns.pairplot(data = df)
Out[18]:
<seaborn.axisgrid.PairGrid at 0x7b2ad82618a0>
In [19]:
sns.jointplot(x='Age', y='Personal_Loan', data=df, kind='hex', gridsize=50, cmap='Blues')
plt.show()
In [20]:
sns.scatterplot(data = df, x = 'Education', y = 'Personal_Loan')
Out[20]:
<Axes: xlabel='Education', ylabel='Personal_Loan'>
In [21]:
sns.scatterplot(data = df, x = 'Securities_Account', y = 'Personal_Loan')
Out[21]:
<Axes: xlabel='Securities_Account', ylabel='Personal_Loan'>
In [22]:
sns.scatterplot(data = df, x = 'Income', y = 'Personal_Loan')
Out[22]:
<Axes: xlabel='Income', ylabel='Personal_Loan'>
In [23]:
sns.scatterplot(data = df, x = 'Experience', y = 'Personal_Loan')
Out[23]:
<Axes: xlabel='Experience', ylabel='Personal_Loan'>

Observations:

  1. Majority (around 3400 out of 5000) of the customers have 0onmortage,withoutliersgoingfromaround250k upwards. The upper quartile is at $100k
  2. 1470 customers have credit cards
  3. Age, Experience, Income, CCAvg, and Mortgage have a strong correlation with Personal Loan
  4. Customer's interest in purchasing a loan do have a clear relationship with age.
  5. Customer's interest in purchasing a loan do have a relationship with education.

Data Preprocessing

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [24]:
df.isnull().sum()
Out[24]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64
In [25]:
df.duplicated().sum()
Out[25]:
0
In [26]:
#Checking the number of unique zip codes
df['ZIPCode'].nunique()
Out[26]:
467
In [27]:
# Bin the zip codes using qcut, converting them to 4 categorical values
df['ZipCode_Binned'] = pd.qcut(df['ZIPCode'], q=4, labels=['1', '2', '3', '4'])
print(df)
        ID  Age  Experience  Income  ZIPCode  Family  CCAvg  Education  \
0        1   25           1      49    91107       4    1.6          1   
1        2   45          19      34    90089       3    1.5          1   
2        3   39          15      11    94720       1    1.0          1   
3        4   35           9     100    94112       1    2.7          2   
4        5   35           8      45    91330       4    1.0          2   
...    ...  ...         ...     ...      ...     ...    ...        ...   
4995  4996   29           3      40    92697       1    1.9          3   
4996  4997   30           4      15    92037       4    0.4          1   
4997  4998   63          39      24    93023       2    0.3          3   
4998  4999   65          40      49    90034       3    0.5          2   
4999  5000   28           4      83    92612       3    0.8          1   

      Mortgage  Personal_Loan  Securities_Account  CD_Account  Online  \
0            0              0                   1           0       0   
1            0              0                   1           0       0   
2            0              0                   0           0       0   
3            0              0                   0           0       0   
4            0              0                   0           0       0   
...        ...            ...                 ...         ...     ...   
4995         0              0                   0           0       1   
4996        85              0                   0           0       1   
4997         0              0                   0           0       0   
4998         0              0                   0           0       1   
4999         0              0                   0           0       1   

      CreditCard ZipCode_Binned  
0              0              1  
1              0              1  
2              0              4  
3              0              3  
4              1              1  
...          ...            ...  
4995           0              2  
4996           0              2  
4997           0              2  
4998           0              1  
4999           1              2  

[5000 rows x 15 columns]
In [28]:
#drop the ID and Zip code columns
df = df.drop(['ID', 'ZIPCode'], axis =1)
print(df)
      Age  Experience  Income  Family  CCAvg  Education  Mortgage  \
0      25           1      49       4    1.6          1         0   
1      45          19      34       3    1.5          1         0   
2      39          15      11       1    1.0          1         0   
3      35           9     100       1    2.7          2         0   
4      35           8      45       4    1.0          2         0   
...   ...         ...     ...     ...    ...        ...       ...   
4995   29           3      40       1    1.9          3         0   
4996   30           4      15       4    0.4          1        85   
4997   63          39      24       2    0.3          3         0   
4998   65          40      49       3    0.5          2         0   
4999   28           4      83       3    0.8          1         0   

      Personal_Loan  Securities_Account  CD_Account  Online  CreditCard  \
0                 0                   1           0       0           0   
1                 0                   1           0       0           0   
2                 0                   0           0       0           0   
3                 0                   0           0       0           0   
4                 0                   0           0       0           1   
...             ...                 ...         ...     ...         ...   
4995              0                   0           0       1           0   
4996              0                   0           0       1           0   
4997              0                   0           0       0           0   
4998              0                   0           0       1           0   
4999              0                   0           0       1           1   

     ZipCode_Binned  
0                 1  
1                 1  
2                 4  
3                 3  
4                 1  
...             ...  
4995              2  
4996              2  
4997              2  
4998              1  
4999              2  

[5000 rows x 13 columns]
In [30]:
#converting the Zipcode binned column to an int
df['ZipCode_Binned'] = df['ZipCode_Binned'].astype(int)

# Function to identify outliers using IQR
def find_outliers_iqr(df):
    outliers = pd.DataFrame()
    for col in df.columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        col_outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outliers = pd.concat([outliers, col_outliers], axis=0)
    return outliers.drop_duplicates()

# Identify outliers
outliers_iqr = find_outliers_iqr(df)

print("Outliers detected by IQR method:")
print(outliers_iqr)
Outliers detected by IQR method:
      Age  Experience  Income  Family  CCAvg  Education  Mortgage  \
18     46          21     193       2    8.1          3         0   
47     37          12     194       4    0.2          3       211   
53     50          26     190       3    2.1          3       240   
59     31           5     188       2    4.5          1       455   
303    49          25     195       4    3.0          1       617   
...   ...         ...     ...     ...    ...        ...       ...   
4477   33           9      41       1    1.5          2         0   
4560   43          18      13       2    0.1          2         0   
4629   48          24     148       2    3.3          1         0   
4671   39          14     104       1    4.0          3         0   
4887   41          15      49       3    0.9          3         0   

      Personal_Loan  Securities_Account  CD_Account  Online  CreditCard  \
18                1                   0           0       0           0   
47                1                   1           1       1           1   
53                1                   0           0       1           0   
59                0                   0           0       0           0   
303               1                   0           0       0           0   
...             ...                 ...         ...     ...         ...   
4477              0                   0           1       1           1   
4560              0                   0           1       1           1   
4629              0                   0           1       1           1   
4671              0                   0           1       1           1   
4887              0                   0           1       1           1   

      ZipCode_Binned  
18                 1  
47                 1  
53                 1  
59                 1  
303                4  
...              ...  
4477               2  
4560               4  
4629               1  
4671               4  
4887               1  

[1354 rows x 13 columns]
In [31]:
#calculating the percentage of outliers
percentage_outliers = (len(outliers_iqr) / len(df)) * 100
print(f"Percentage of outliers: {percentage_outliers:.2f}%")
Percentage of outliers: 27.08%

Observations:

  1. There are no null and duplicates values
  2. There are 467 unique zip codes
  3. There is a total of 27.08% outliers in the whole dataset.
  4. We can't do away with the outliers because they are important

Model Building

Model Evaluation Criterion

We are dealing with a classification problem: whether or not a customer will buy personal loans. So I am going to use Decision Tree for the model building.

Model Building

In [32]:
#split data
X = df.drop('Personal_Loan', axis=1)
y = df.pop('Personal_Loan')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 5)

#build decision tree model
dtPerLoan = DecisionTreeClassifier(criterion = 'gini', random_state = 5)
dtPerLoan.fit(X_train, y_train)
Out[32]:
DecisionTreeClassifier(random_state=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [33]:
#score the decision tree
print(dtPerLoan.score(X_train, y_train))
print(dtPerLoan.score(X_test, y_test))
1.0
0.974
In [34]:
#checking number of positives
y.sum(axis=0)
Out[34]:
480

Observations

  1. 480 of the total users are predicted by the model to buy personal loans

What does the bank want?

Prediction of customers who would buy personal loans so they can be targetted with Ads. So, there is no harm if the model predicts customers who may not want the loan. So there are two losses here:

  1. Targeting customers who wouldn't buy the loan with Ads
  2. Missing out on customers who would buy the loan

We can afford the first loss but not the second. We want to capture as much customers who would buy the loan as possible. So we'll also try Recall for Model Evaluation Metric

Model Performance Improvement

In [40]:
def confusion_matrix(model,y_actual,labels=[1,0]):
    '''
    model: classifier to predict values of x
    y_actual = ground truth

    '''

    y_predict = model.predict(X_test)
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0,1])
    df.cm = pd.DataFrame(cm, index = [i for i in ["Actual -No", "Actual -Yes"]], columns = [i for i in ["Predicted -No", "Predicted - Yes"]])
    group_counts =["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts,group_percentages)]
    labels = np.array(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df.cm, annot = labels, fmt = '')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')


def cal_recall_score(model):
    '''
    model : classifier to predict the values of X

    '''

    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    print("Recall on training data set: ", metrics.recall_score(y_train, pred_train))
    print("Recall on test data set: ", metrics.recall_score(y_test, pred_test))

#making confusion matrix for this model
confusion_matrix(dtPerLoan, y_test)
In [41]:
#recall on train and test sets
cal_recall_score(dtPerLoan)
Recall on training data set:  1.0
Recall on test data set:  0.8993288590604027
In [42]:
#visualizing the Decision Tree
feature_names = list(X.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZipCode_Binned']
In [43]:
plt.figure(figsize=(20,30))
tree.plot_tree(dtPerLoan, feature_names = feature_names, filled = True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [44]:
#showing the result in text form
print(tree.export_text(dtPerLoan,feature_names = feature_names, show_weights= True))
|--- Income <= 112.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2549.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Age <= 53.00
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- weights: [27.00, 0.00] class: 0
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- Experience <= 13.50
|   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  13.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Age >  53.00
|   |   |   |   |--- Age <= 57.00
|   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  57.00
|   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 29.50
|   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Age >  29.50
|   |   |   |   |   |--- CCAvg <= 3.50
|   |   |   |   |   |   |--- Mortgage <= 189.50
|   |   |   |   |   |   |   |--- Income <= 80.50
|   |   |   |   |   |   |   |   |--- Experience <= 13.00
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  13.00
|   |   |   |   |   |   |   |   |   |--- weights: [19.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  80.50
|   |   |   |   |   |   |   |   |--- Experience <= 25.50
|   |   |   |   |   |   |   |   |   |--- ZipCode_Binned <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ZipCode_Binned >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  25.50
|   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  189.50
|   |   |   |   |   |   |   |--- Income <= 68.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Income >  68.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.50
|   |   |   |   |   |   |--- ZipCode_Binned <= 3.50
|   |   |   |   |   |   |   |--- weights: [52.00, 0.00] class: 0
|   |   |   |   |   |   |--- ZipCode_Binned >  3.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.75
|   |   |   |   |   |   |   |   |--- Experience <= 23.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  23.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 118.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  118.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  3.75
|   |   |   |   |   |   |   |   |--- weights: [14.00, 0.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- CCAvg <= 4.50
|   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |--- Income <= 101.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  101.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  4.50
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- CCAvg <= 4.15
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  4.15
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- CCAvg <= 3.45
|   |   |   |   |   |   |--- Experience <= 16.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  16.50
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.45
|   |   |   |   |   |   |--- Experience <= 10.00
|   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- Experience >  10.00
|   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.85
|   |   |   |   |   |   |   |   |   |--- Age <= 51.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  51.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  3.85
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 8.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- weights: [0.00, 12.00] class: 1
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- Mortgage <= 81.00
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Mortgage >  81.00
|   |   |   |   |   |--- Income <= 93.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  93.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|--- Income >  112.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [409.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 43.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- Experience <= 34.50
|   |   |   |   |--- Age <= 54.50
|   |   |   |   |   |--- Mortgage <= 42.00
|   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |--- ZipCode_Binned <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZipCode_Binned >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Mortgage >  42.00
|   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- Age >  54.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Experience >  34.50
|   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- ZipCode_Binned <= 2.50
|   |   |   |   |   |   |--- Experience <= 10.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  10.50
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |--- ZipCode_Binned >  2.50
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 220.00] class: 1

In [46]:
#checking the importance of various features
print(pd.DataFrame(dtPerLoan.feature_importances_, columns = ["importance"], index = X_train.columns).sort_values(by = 'importance', ascending=False))
                    importance
Education             0.397207
Income                0.311411
Family                0.143470
CCAvg                 0.055520
CD_Account            0.027328
Experience            0.020602
Age                   0.019273
Mortgage              0.010252
ZipCode_Binned        0.008824
Online                0.003608
CreditCard            0.002503
Securities_Account    0.000000
In [47]:
#plotting importance as a barchart

imp = dtPerLoan.feature_importances_
indices = np.argsort(imp)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), imp[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Pre-pruning with a max depth of 3

In [48]:
'''
pre_pruning using Max Depth of 3

'''

dtPerLoan1 = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state = 5)
dtPerLoan1.fit(X_train, y_train)

#the confusion matrix for the new model
confusion_matrix(dtPerLoan1, y_test)
In [49]:
#accuracy on dtPerLoan1
print("Accuracy on training set:", dtPerLoan1.score(X_train, y_train))
print("Accuracy on testing set:", dtPerLoan1.score(X_test, y_test))
Accuracy on training set: 0.986
Accuracy on testing set: 0.9773333333333334
In [50]:
#recall on dtPerLoan1 test and train
cal_recall_score(dtPerLoan1)
Recall on training data set:  0.8731117824773413
Recall on test data set:  0.8187919463087249
In [51]:
#visualizing the dtPerLoan1 tree

plt.figure(figsize=(15,10))
tree.plot_tree(dtPerLoan1, feature_names = feature_names, filled = True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [52]:
#showing the result in text form
print(tree.export_text(dtPerLoan1,feature_names = feature_names, show_weights= True))
|--- Income <= 112.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2549.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- weights: [42.00, 4.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- weights: [150.00, 32.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 17.00] class: 1
|--- Income >  112.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [409.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [1.00, 44.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- weights: [12.00, 6.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- weights: [3.00, 228.00] class: 1

In [53]:
#checking the importance of various features
print(pd.DataFrame(dtPerLoan1.feature_importances_, columns = ["importance"], index = X_train.columns).sort_values(by = 'importance', ascending=False))
                    importance
Education             0.440745
Income                0.336106
Family                0.149539
CCAvg                 0.042009
CD_Account            0.031600
Age                   0.000000
Experience            0.000000
Mortgage              0.000000
Securities_Account    0.000000
Online                0.000000
CreditCard            0.000000
ZipCode_Binned        0.000000
In [54]:
#plotting importance as a barchart

imp = dtPerLoan1.feature_importances_
indices = np.argsort(imp)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), imp[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Pre-pruning using GridSearchCV for Hyperparamter

In [55]:
#choosing classifier
estimator = DecisionTreeClassifier(random_state = 5)
parameters = {'max_depth': np.arange(1,10),
              'min_samples_leaf': [1,3,5,9,11,13,17],
              'max_leaf_nodes': [2,3,5,10],
              'min_impurity_decrease': [0.001, 0.01, 0.1]}

#type of scorer for parameter combination
acc_scorer = metrics.make_scorer(metrics.recall_score)
In [56]:
#run the grid search  estimator
grid_obj = GridSearchCV(estimator, parameters, scoring = acc_scorer, cv=5)
grid_obj= grid_obj.fit(X_train, y_train)
In [57]:
#set clf to the best parameter combination
estimator = grid_obj.best_estimator_

#fit the best algorithm to the data
estimator.fit(X_train, y_train)
Out[57]:
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, min_samples_leaf=9,
                       random_state=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [58]:
#confusion matrix for the Gridsearch
confusion_matrix(estimator, y_test)
In [59]:
#accuracy on estimator
print("Accuracy on training set:", estimator.score(X_train, y_train))
print("Accuracy on testing set:", estimator.score(X_test, y_test))
Accuracy on training set: 0.988
Accuracy on testing set: 0.9833333333333333
In [60]:
#recall on estimator test and train
cal_recall_score(estimator)
Recall on training data set:  0.918429003021148
Recall on test data set:  0.912751677852349
In [61]:
#visualizing the estimator tree

plt.figure(figsize=(15,10))
tree.plot_tree(estimator, feature_names = feature_names, filled = True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [62]:
#showing the result in text form
print(tree.export_text(estimator,feature_names = feature_names, show_weights= True))
|--- Income <= 112.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2591.00, 4.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- weights: [107.00, 12.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [35.00, 5.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- weights: [8.00, 15.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 17.00] class: 1
|--- Income >  112.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [409.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [1.00, 44.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- weights: [12.00, 6.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- weights: [3.00, 228.00] class: 1

In [63]:
#checking the importance of various features
print(pd.DataFrame(estimator.feature_importances_, columns = ["importance"], index = X_train.columns).sort_values(by = 'importance', ascending=False))
                    importance
Education             0.446667
Income                0.334944
Family                0.146349
CCAvg                 0.041113
CD_Account            0.030926
Age                   0.000000
Experience            0.000000
Mortgage              0.000000
Securities_Account    0.000000
Online                0.000000
CreditCard            0.000000
ZipCode_Binned        0.000000
In [64]:
#plotting importance as a barchart

imp = estimator.feature_importances_
indices = np.argsort(imp)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), imp[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Post_pruning with CCAlpha

In [65]:
clf = DecisionTreeClassifier(random_state= 5)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities

pd.DataFrame(path)
Out[65]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000262 0.000524
2 0.000273 0.001070
3 0.000278 0.001626
4 0.000278 0.002737
5 0.000279 0.003296
6 0.000330 0.004286
7 0.000343 0.004972
8 0.000381 0.005353
9 0.000381 0.005734
10 0.000416 0.006981
11 0.000429 0.007409
12 0.000429 0.007838
13 0.000429 0.008266
14 0.000436 0.010444
15 0.000445 0.011334
16 0.000445 0.011779
17 0.000457 0.014065
18 0.000476 0.014541
19 0.000490 0.015522
20 0.000544 0.016066
21 0.000567 0.017199
22 0.000576 0.018926
23 0.000997 0.019923
24 0.001712 0.023346
25 0.004077 0.027424
26 0.004680 0.032104
27 0.006222 0.038326
28 0.022147 0.060473
29 0.055391 0.171255
In [66]:
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker= 'o', drawstyle = 'steps-post')
ax.set_xlabel('effective alphas')
ax.set_ylabel('total impurities of leaves')
ax.set_title('Total Impurties vs effective alphas for training set')
plt.show()
In [68]:
#training the DT with effective alphas
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state= 5, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

print('Number of nodes in the last tree is: {} with ccp_alpha: {}'.format(
    clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.05539116545495793
In [69]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
In [70]:
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax= plt.subplots(2,1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker = 'o', drawstyle = 'steps-post')
ax[0].set_xlabel('alpha')
ax[0].set_ylabel('number of nodes')
ax[0].set_title('Number of nodes vs alphas')
ax[1].plot(ccp_alphas, depth, marker = 'o', drawstyle = 'steps-post')
ax[1].set_xlabel('alpha')
ax[1].set_ylabel('depth of tree')
ax[1].set_title('depth  vs alphas')
fig.tight_layout()
In [71]:
#accuracy vs alpha for training and testing sets

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel('alpha')
ax.set_ylabel('accuracy')
ax.set_title('Accuracy vs alpha  for training  and testing set')
ax.plot(ccp_alphas, train_scores, marker = 'o', label='train', drawstyle = 'steps-post')
ax.plot(ccp_alphas, test_scores, marker = 'o', label='test', drawstyle = 'steps-post')
ax.legend()
plt.show()
In [72]:
#accuracy of the model
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy for best model:', best_model.score(X_train, y_train))
print('Test accuracy for best model:', best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.0009967221671277216, random_state=5)
Training accuracy for best model: 0.988
Test accuracy for best model: 0.9833333333333333
In [73]:
# but accuracy is not our best metric, recall is, so we'll use recall

recall_train = []
for clf in clfs:
    pred_train3 = clf.predict(X_train)
    values_train= metrics.recall_score(y_train, pred_train3)
    recall_train.append(values_train)

recall_test = []
for clf in clfs:
    pred_test3 = clf.predict(X_test)
    values_test= metrics.recall_score(y_test, pred_test3)
    recall_test.append(values_test)
In [74]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel('alpha')
ax.set_ylabel('recall')
ax.set_title('recall vs alpha  for training  and testing set')
ax.plot(ccp_alphas, recall_train, marker = 'o', label='train', drawstyle = 'steps-post')
ax.plot(ccp_alphas, recall_test, marker = 'o', label='test', drawstyle = 'steps-post')
ax.legend()
plt.show()
In [75]:
#recall of the model
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0005668737060041407, random_state=5)
In [76]:
#confusion matrix for recall
confusion_matrix(best_model, y_test)
In [77]:
#recall on train and test set
cal_recall_score(best_model)
Recall on training data set:  0.9365558912386707
Recall on test data set:  0.912751677852349
In [78]:
#visualizing
plt.figure(figsize=(15,10))
tree.plot_tree(best_model, feature_names = feature_names, filled = True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [79]:
#text report showing the rules of the DT
print(tree.export_text(best_model,feature_names = feature_names, show_weights= True))
|--- Income <= 112.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2549.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Age <= 53.00
|   |   |   |   |--- weights: [35.00, 1.00] class: 0
|   |   |   |--- Age >  53.00
|   |   |   |   |--- Age <= 57.00
|   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  57.00
|   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 29.50
|   |   |   |   |   |--- weights: [1.00, 3.00] class: 1
|   |   |   |   |--- Age >  29.50
|   |   |   |   |   |--- weights: [106.00, 9.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [35.00, 5.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- weights: [8.00, 15.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 17.00] class: 1
|--- Income >  112.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [409.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [1.00, 44.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- weights: [12.00, 6.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- weights: [3.00, 228.00] class: 1

In [80]:
#checking the importance of various features
print(pd.DataFrame(best_model.feature_importances_, columns = ["importance"], index = X_train.columns).sort_values(by = 'importance', ascending=False))
                    importance
Education             0.438771
Income                0.330291
Family                0.143762
CCAvg                 0.040386
CD_Account            0.030379
Age                   0.016410
Experience            0.000000
Mortgage              0.000000
Securities_Account    0.000000
Online                0.000000
CreditCard            0.000000
ZipCode_Binned        0.000000
In [81]:
#plotting importance as a barchart

imp = best_model.feature_importances_
indices = np.argsort(imp)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), imp[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations:

  1. The first model accuracy on training data is 1 and on test data is 0.974 The first model decison Tree model Recall score on training data is 1 and on Test data is 0.8993288590604027. Quite impressive because the TN was 8.93% and the FP and FN were 1.60% and 1.00% respectively. According to the model:

     Education is the most important variable. But the tree is quite complex and it may overfit.
  1. The second is pre-pruned with a max depth of 3 The accuracy of the second model is: 0.9773333333333334 on the test data and 0.986 on the training data. The recall score is:

     on training data set:  0.8731117824773413
     on test data set:  0.8187919463087249
    
     Education  is still the leading variable followed by income.
  1. The Third model is Pre-pruned with GridSearchCV with Hyperparameters. Its scores are:

     Accuracy on training set: 0.988
     Accuracy on testing set: 0.9833333333333333
     Recall on training data set:  0.918429003021148
     Recall on test data set:  0.912751677852349
    
     Education still leads as the most important variable followed by income.
  2. The fourth model is post-pruned with CCAlpha: As alpha grow, impurities also grow. Also,

    Training accuracy for best model: 0.988
    Test accuracy for best model: 0.9833333333333333
    Recall on training data set:  0.9365558912386707
    Recall on test data set:  0.912751677852349
    
    Education is the leading important variable.

Model Comparison and Final Model Selection

In [82]:
comparison_frame =pd.DataFrame({'Model':['Initial decision tree model', 'Decision tree with restricted maximum depth', 'Decision tree with hyperparameter tuning', 'Decision tree with post-pruning'], 'Train_Recall': [1, 0.977, 0.918, 0.936], 'Test_Recall': [0.899, 0.818, 0.912, 0.912]})
comparison_frame
Out[82]:
Model Train_Recall Test_Recall
0 Initial decision tree model 1.000 0.899
1 Decision tree with restricted maximum depth 0.977 0.818
2 Decision tree with hyperparameter tuning 0.918 0.912
3 Decision tree with post-pruning 0.936 0.912

Observations

Decision Tree on with Post Prunning and Hyperparameter Tuning gives the highest on Recall score on test set. But we'd go with the Decision Tree Model with Post Tuning.

Actionable Insights and Business Recommendations

  • What recommedations would you suggest to the bank?
  1. Education, Income and Average spending on credit cards per month greatly influences if the customer will take a loan or not, the bank should target customers whoo have high values in these.
  2. The data suggest that a lower percentage of the users might purchase a loan. So, the bank can improve its loan incentives.